Goto

Collaborating Authors

 data analytic


UQE: A Query Engine for Unstructured Databases

Neural Information Processing Systems

Analytics on structured data is a mature field with many successful methods.However, most real world data exists in unstructured form, such as images and conversations.We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics.In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators.The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution.In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls.We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation.


LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources

Ji, Haichao, Wang, Zibo, Pan, Cheng, Han, Meng, Zhu, Yifei, Wang, Dan, Han, Zhu

arXiv.org Artificial Intelligence

Abstract--Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (F A) enables privacy-preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAF A, the first system that integrates LLM-agent-based data analytics with F A. LAF A introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable F A workflows. T o improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAF A consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive F A operations by a substantial margin. This work establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in the F A setting. The rapid development of Large Language Models (LLMs) has offered unprecedented capabilities in natural language understanding, reasoning, and planning [1], significantly transforming the landscape of data analytics. LLMs can interpret complex analytical intents, generate structured code, and orchestrate multi-step tasks by interacting with external environments such as databases and computation sandboxes. These capabilities have led to the emergence of LLM-based agents that decompose high-level queries, plan analytical workflows, and execute or verify results through tool interactions.


DataPuzzle: Breaking Free from the Hallucinated Promise of LLMs in Data Analysis

Zhang, Zhengxuan, Liang, Zhuowen, Wu, Yin, Lin, Teng, Luo, Yuyu, Tang, Nan

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly applied to multi-modal data analysis--not necessarily because they offer the most precise answers, but because they provide fluent, flexible interfaces for interpreting complex inputs. Y et this fluency often conceals a deeper structural failure: the prevailing "Prompt-to-Answer" paradigm treats LLMs as black-box analysts, collapsing evidence, reasoning, and conclusions into a single, opaque response. The result is brittle, unverifiable, and frequently misleading. We argue for a fundamental shift: from generation to structured extraction, from monolithic prompts to modular, agent-based workflows. LLMs should not serve as oracles, but as collaborators--specialized in tasks like extraction, translation, and linkage--embedded within transparent workflows that enable step-by-step reasoning and verification. We propose Data-Puzzle, a conceptual multi-agent framework that decomposes complex questions, structures information into interpretable forms (e.g., tables, graphs), and coordinates agent roles to support transparent and verifiable analysis. This framework serves as an aspirational blueprint for restoring visibility and control in LLMdriven analytics--transforming opaque answers into traceable processes, and brittle fluency into accountable insight. This is not a marginal refinement; it is a call to reimagine how we build trustworthy, auditable analytic systems in the era of large language models. Structure is not a constraint--it is the path to clarity.


EPIC: Generative AI Platform for Accelerating HPC Operational Data Analytics

Karimi, Ahmad Maroof, Shin, Woong, Hines, Jesse, Ghosal, Tirthankar, Sattar, Naw Safrin, Wang, Feiyi

arXiv.org Artificial Intelligence

We present EPIC, an AI-driven platform designed to augment operational data analytics. EPIC employs a hierarchical multi-agent architecture where a top-level large language model provides query processing, reasoning and synthesis capabilities. These capabilities orchestrate three specialized low-level agents for information retrieval, descriptive analytics, and predictive analytics. This architecture enables EPIC to perform HPC operational analytics on multi-modal data, including text, images, and tabular formats, dynamically and iteratively. EPIC addresses the limitations of existing HPC operational analytics approaches, which rely on static methods that struggle to adapt to evolving analytics tasks and stakeholder demands. Through extensive evaluations on the Frontier HPC system, we demonstrate that EPIC effectively handles complex queries. Using descriptive analytics as a use case, fine-tuned smaller models outperform large state-of-the-art foundation models, achieving up to 26% higher accuracy. Additionally, we achieved 19x savings in LLM operational costs compared to proprietary solutions by employing a hybrid approach that combines large foundational models with fine-tuned local open-weight models.


Deep opacity and AI: A threat to XAI and to privacy protection mechanisms

Müller, Vincent C.

arXiv.org Artificial Intelligence

It is known that big data analytics and AI pose a threat to privacy, and that some of this is due to some kind of "black box problem" in AI. I explain how this becomes a problem in the context of justification for judgments and actions. Furthermore, I suggest distinguishing three kinds of opacity: 1) the subjects do not know what the system does ("shallow opacity"), 2) the analysts do not know what the system does ("standard black box opacity"), or 3) the analysts cannot possibly know what the system might do ("deep opacity"). If the agents, data subjects as well as analytics experts, operate under opacity, then these agents cannot provide justifications for judgments that are necessary to protect privacy, e.g., they cannot give "informed consent", or guarantee "anonymity". It follows from these points that agents in big data analytics and AI often cannot make the judgments needed to protect privacy. So I conclude that big data analytics makes the privacy problems worse and the remedies less effective. As a positive note, I provide a brief outlook on technical ways to handle this situation.


Track Component Failure Detection Using Data Analytics over existing STDS Track Circuit data

López, Francisco, Di Santi, Eduardo, Lefebvre, Clément, Mijatovic, Nenad, Pugnaloni, Michele, Martín, Victor, Saiah, Kenza

arXiv.org Machine Learning

A track circuit is an electrical system that detects the presence of a train on the tracks by passing a current through the rails, which acts as a conductor. In its initial form, track circuits consisted of a battery and a relay with adjustable resistors to set the transmitted signal gain and receiver operating point. Sections of track are electrically isolated by insulated joints in each rail. The transmitted signal travels through a single rail, through the relay at the opposite end, then returning to the transmitter through the other rail. Track circuits follow the closed loop principle, which means that any failure results in the safest state (track occupied) as shown in Figure 1. Because of this, track circuits also provide detection of broken rails.Figure 1: Track circuit behaviour schema Nowadays, there are many types of track circuits. The last state of the art ones provide enhanced performance, integrating sophisticated signalling systems to improve operation and safety. Track-circuit failures have an important impact as they imply a stop of operations and an economic impact for both the railway operator and its customers (1).


Farm-LightSeek: An Edge-centric Multimodal Agricultural IoT Data Analytics Framework with Lightweight LLMs

Jiang, Dawen, Shen, Zhishu, Zheng, Qiushi, Zhang, Tiehua, Xiang, Wei, Jin, Jiong

arXiv.org Artificial Intelligence

Amid the challenges posed by global population growth and climate change, traditional agricultural Internet of Things (IoT) systems is currently undergoing a significant digital transformation to facilitate efficient big data processing. While smart agriculture utilizes artificial intelligence (AI) technologies to enable precise control, it still encounters significant challenges, including excessive reliance on agricultural expert knowledge, difficulties in fusing multimodal data, poor adaptability to dynamic environments, and bottlenecks in real-time decision-making at the edge. Large language models (LLMs), with their exceptional capabilities in knowledge acquisition and semantic understanding, provide a promising solution to address these challenges. To this end, we propose Farm-LightSeek, an edge-centric multimodal agricultural IoT data analytics framework that integrates LLMs with edge computing. This framework collects real-time farmland multi-source data (images, weather, geographic information) via sensors, performs cross-modal reasoning and disease detection at edge nodes, conducts low-latency management decisions, and enables cloud collaboration for model updates. The main innovations of Farm-LightSeek include: (1) an agricultural "perception-decision-action" closed-loop architecture; (2) cross-modal adaptive monitoring; and (3)a lightweight LLM deployment strategy balancing performance and efficiency. Experiments conducted on two real-world datasets demonstrate that Farm-LightSeek consistently achieves reliable performance in mission-critical tasks, even under the limitations of edge computing resources. This work advances intelligent real-time agricultural solutions and highlights the potential for deeper integration of agricultural IoT with LLMs.


UQE: A Query Engine for Unstructured Databases

Neural Information Processing Systems

Analytics on structured data is a mature field with many successful methods.However, most real world data exists in unstructured form, such as images and conversations.We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics.In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators.The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution.In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls.We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation.


Highly Efficient Direct Analytics on Semantic-aware Time Series Data Compression

Sun, Guoyou, Karras, Panagiotis, Zhang, Qi

arXiv.org Artificial Intelligence

Semantic communication has emerged as a promising paradigm to tackle the challenges of massive growing data traffic and sustainable data communication. It shifts the focus from data fidelity to goal-oriented or task-oriented semantic transmission. While deep learning-based methods are commonly used for semantic encoding and decoding, they struggle with the sequential nature of time series data and high computation cost, particularly in resource-constrained IoT environments. Data compression plays a crucial role in reducing transmission and storage costs, yet traditional data compression methods fall short of the demands of goal-oriented communication systems. In this paper, we propose a novel method for direct analytics on time series data compressed by the SHRINK compression algorithm. Through experimentation using outlier detection as a case study, we show that our method outperforms baselines running on uncompressed data in multiple cases, with merely 1% difference in the worst case. Additionally, it achieves four times lower runtime on average and accesses approximately 10% of the data volume, which enables edge analytics with limited storage and computation power. These results demonstrate that our approach offers reliable, high-speed outlier detection analytics for diverse IoT applications while extracting semantics from time-series data, achieving high compression, and reducing data transmission.


Data Driven Decision Making with Time Series and Spatio-temporal Data

Yang, Bin, Liang, Yuxuan, Guo, Chenjuan, Jensen, Christian S.

arXiv.org Artificial Intelligence

Time series data captures properties that change over time. Such data occurs widely, ranging from the scientific and medical domains to the industrial and environmental domains. When the properties in time series exhibit spatial variations, we often call the data spatio-temporal. As part of the continued digitalization of processes throughout society, increasingly large volumes of time series and spatio-temporal data are available. In this tutorial, we focus on data-driven decision making with such data, e.g., enabling greener and more efficient transportation based on traffic time series forecasting. The tutorial adopts the holistic paradigm of "data-governance-analytics-decision." We first introduce the data foundation of time series and spatio-temporal data, which is often heterogeneous. Next, we discuss data governance methods that aim to improve data quality. We then cover data analytics, focusing on five desired characteristics: automation, robustness, generality, explainability, and resource efficiency. We finally cover data-driven decision making strategies and briefly discuss promising research directions. We hope that the tutorial will serve as a primary resource for researchers and practitioners who are interested in value creation from time series and spatio-temporal data.